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The  problem  of  proving  generalization  bounds  for  the  performance  of  learning  algorithms 
can  be  formulated  as  a  problem  of  bounding  the  bias  and  variance  of  estimators  of  the 
expected  error.  We  show  how  various  stability  assumptions  can  be  employed  for  this 
purpose.  We  provide  a  necessary  and  sufficient  stability  condition  for  bounding  the  bias 
and  variance  for  the  Empirical  Risk  Minimization  algorithm,  and  various  sufficient  con¬ 
ditions  for  bounding  bias  and  variance  of  estimators  for  general  algorithms.  We  discuss 
settings  in  which  it  is  possible  to  obtain  exponential  bounds,  and  we  prove  an  extension 
of  the  bounded-difference  inequality  for  “almost  always”  stable  algorithms. 

Keywords:  Stability;  generalization;  estimators;  empirical  risk  minimization. 
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1.  Introduction 

One  of  the  central  problems  of  Statistical  Learning  Theory  is  to  quantify  the  general¬ 
ization  ability  of  learning  algorithms  within  a  probabilistic  framework.  The  standard 
setting  for  the  problem  is  the  following.  Let  T  be  a  class  of  real- valued  functions  on 
a  space  X ,  mapping  X  into  y  C  K.  Denote  by  /i  an  unknown  probability  measure 
on  Z  =  X  x  y.  An  algorithm  A  is  a  mapping  A  :  Zn  i— >  T ,  n  €  Z+.  In  plain  words, 
a  learning  algorithm  observes  n  input-output  pairs  and  produces  (learns)  a  function 
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which  describes  well  the  underlying  input-output  process.  Throughout  this  article, 
we  will  focus  on  symmetric  algorithms,  i.e.  A(zi, . . . ,  zn)  =  A^^zi, . . . ,  zn ))  for  any 
permutation  ir  €  Sn,  the  symmetric  group. 

Let  A(zi, . . . ,  zn',  x)  denote  the  evaluation  of  the  function  A(zi,...,zn)  at  a 
point  x.  To  measure  the  quality  of  A(z±, . . . ,  zn),  a  loss  function  £  :  T  x  Z  i— >  [0  ,M\ 
is  introduced,  such  that  £(A{z\, . . . ,  zn);  z)  is  a  measure  of  how  well  A(zi, . . . ,  zn) 
predicts  y  at  point  x,  where  (x,  y)  =  z  £  Z.  The  function  £  is  often  taken  to  be  the 
square  loss. 

If  the  algorithm  A  is  clear  from  the  context,  we  will  write  £(zi, . . . ,  zn]  z)  instead 
of  £{A{z\, . . . ,  zn)m,  z) .  The  functions  £(f;  •)  are  called  the  loss  functions  and  the  class 
C{!F)  =  {£{f\  ■)  :  /  £  J7}  is  called  the  loss  class. 

The  main  quantity  of  interest  is  the  expected  error  of  the  function  ^4(^i, . . . ,  zn), 

Iexp(zi,  ■  -  ■ ,  Zn)  \='E.z~ll,[£(z\,...,zn',z)]  =  /  l{z\ , . . . ,  zn\  z)  dp,(z) . 


This  quantity  measures  the  accuracy  of  A(zi, . . . ,  zn)  on  the  unseen  data  z  drawn 
from  p.  Unfortunately,  the  measure  p  is  unknown  and  this  quantity  cannot  be 
computed.  The  key  assumption  made  in  the  Statistical  Learning  Theory  is  that 
the  observed  sample  Zj. ....  zn  is  independent  and  identically  distributed  (i.ixl.) 
with  the  generating  distribution  p.  The  problem  thus  is  to  estimate  Iex p(zi,  ■  ■  ■ ,  zn) 
based  on  the  finite  sample  Zi, ...  ,zn. 

Although  the  expected  error  is  unknown,  several  important  quantities  can  be 
computed  from  the  sample.  The  first  one  is  the  empirical  error  (or  resubstitution 
estimate), 


The  second  one  is  the  leave-one-out  error  (or  deleted  estimate), a 


These  quantities  are  employed  to  estimate  the  expected  error,  and  the  Statistical 
Learning  Theory  is  concerned  with  providing  bounds  on  the  deviations  of  these 
estimates  from  the  expected  error.  Denote  these  deviations 


^n(^  1>  •  ■  •  5  Zn)  : —  ^exp(^li  •  •  -  ,  Zn)  I\oo[Z\,  •  •  •  ,  Zn) . 

If  one  can  show  that  >1 (or  $„)  is  “small”,  then  the  empirical  error  (resp.  leave-one- 
out  error)  is  a  good  proxy  for  the  expected  error.  In  particular,  we  are  interested  in 


aIt  is  understood  that  the  first  term  in  the  sum  is  £(z2,  ■  ■  ■ ,  zn;  zi)  and  the  last  term  is 
i{z i, . . . ,  Z„_i;Zn). 
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the  rate  of  the  convergence  of  'E'n  and  <J>„  to  zero  as  n  increases.  Such  statements, 
of  course,  have  to  be  made  in  probability. 

Let  us  first  focus  on  the  random  variable  fyn(zi, . . . ,  zn).  Recall  that  the  Central 
Limit  Theorem  (CLT)  guarantees  that  the  average  of  n  i.i.d.  random  variables  con¬ 
verges  to  their  mean  (under  the  assumption  of  finiteness  of  second  moment).  Unfor¬ 
tunately,  the  random  variables  £{z\, . . . ,  zn\  zi), . . . ,  £(z±, . . . ,  zn\  zn)  are  dependent, 
and  the  CLT  is  not  applicable.  In  fact,  the  interdependence  of  these  random  vari¬ 
ables  makes  the  resubstitution  estimate  positively  biased ,  as  the  next  example  shows. 


Example  1.1.  Let  X  =  [0, 1],  y  =  {0, 1},  p(x)  =  U[ 0, 1],  n{y\x)  =  5y= i,  £{y,  y')  = 
\y  —  y'\,  and  A  is  defined  as  A(zi, . . . ,  zn;  x)  =  1  if  x  £  {xi, . . . ,  xn }  and  0  otherwise. 
In  other  words,  the  algorithm  observes  n  data  points  (x,  1),  where  x  is  distributed 
uniformly  on  [0,1],  and  generates  a  hypothesis  which  fits  exactly  the  observed  data, 
but  outputs  0  for  unseen  points  x.  The  empirical  error  of  A  is  0,  while  the  expected 
error  is  1,  i.e.  ...,zn)  =  1  for  any  z  i, . . . ,  zn. 


The  algorithm  in  Example  1.1  is  the  Empirical  Risk  Minimization  ( ERM ) 
algorithm 


1  x  ^ 

.4(zi,...,z„)  =  argmin- V^(/;zi). 

fer  n  ' 

l—l 


In  the  above  example,  the  function  class  T  =  Un>i{/x  :  x  =  (xi, . . . ,  xn)  £  [0,  l]71} 
where  fx(x)  =  1  if  x  =  Xi  for  some  1  <  i  <  n  and  fx(x)  =  0  otherwise. 

Though  an  exact  minimizer  of  empirical  risk  might  not  exist,  an  almost- 
minimizer  always  exists.  The  results  of  this  paper  hold  for  almost-minimizers,  but, 
for  the  sake  of  clarity,  we  consider  exact  minimization. 

Minimizing  the  empirical  error  is  a  natural  idea,  as  long  as  guarantees  on  small¬ 
ness  of  \E'„(zi, . . . ,  zn)  can  be  made.  Note  that  no  such  guarantee  can  be  made  in 
Example  1.1.  Intuitively,  this  is  due  to  the  fact  that  the  algorithm  can  fit  any  data, 
i.e.  the  space  of  functions  C(T)  is  too  large.  Indeed,  convergence  of  empirical  errors 
to  the  expected  errors  is  completely  characterized  by  the  “size”  of  C{T).  Such  a 
characterization  disregards  the  algorithm  A,  and  only  focuses  on  the  loss  function 
£  and  the  class  IF,  from  which  the  functions  are  chosen.  The  class  £(lF)  is  called 
uniform  Glivenko-Cantelli  if  for  every  e  >  0, 


lim  sup  P  sup 

rw°°  ^  \teC(F) 


1  n 
n  z ' 


i—  1 


>£  =0, 


where  Z\, . . .  ,zn  are  i.i.d.  random  variables  distributed  according  to  p. 
Non-asymptotic  results  of  the  form 


P  (  sup 
ye-CCF) 


n  A ' 


>  £ 


<  S(e,  n,  C(£F)) 


give  uniform  (over  class  C{£F))  rates  of  convergence  of  empirical  means  to  the 
expected  means.  Since  the  guarantee  is  given  for  all  functions  in  the  class,  the 
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defect  T n(zi , . . . ,  zn)  is  bounded  by  e  with  probability  1  —  i5(e,  n,  £(IF)),  no  matter 
what  the  algorithm  is. 

Albeit  interesting  from  the  theoretical  point  of  view,  the  uniform  bounds  are  in 
general  loose,  as  they  are  “worst  case”  over  all  functions  in  the  class.  As  an  extreme 
example,  consider  the  algorithm  that  always  ouputs  the  same  function  (the  constant 
algorithm) 

A(zi,...,zn)  =  fo,  \/(zi,...,zn)  <E  Zn. 

The  bound  on  $„(zi, . . . ,  zn)  follows  from  the  CLT  and  an  analysis  based  upon  the 
complexity  of  a  class  T  does  not  make  sense.  Recent  advances  in  the  Statistical 
Learning  Theory  shift  the  focus  from  uniform  bounds  to  non-uniform  bounds  of 
the  form 


P(|^„(21,...,z„)|  >  e)  <  8(e,n,A),  (1.1) 

or 

P(|1'„(2;i,  |  >  e(S,  n,  A,Zi,...,  zn))  <  5, 

where  in  the  last  bound  e  depends  on  the  sample.  In  this  article,  we  will  focus  on  the 
bounds  of  type  (1.1).  The  goal  is  to  derive  bounds  on  (or  <!>„)  such  that 

limn-.oo  S(e,n,A)  =  0  for  any  fixed  e  >  0.  If  the  rate  of  decrease  of  5(e,n,A)  is 

P  P 

not  important,  we  will  write  |\l/n|  — >  0  and  |<&n|  — >  0. 

Notice  that  and  are  bounded  random  variables,  as  the  loss  function 
£  C  [0,  M],  By  Markov’s  inequality, 

Ve  >  0,  P(|ttn|  >  e)  < 

and  also, 

Ve'  >  0,  E|tf„|  <  MP(|^„|  >£')  +  £'. 

p 

Therefore,  showing  |^„|  — >  0  is  equivalent  to  showing  E|T„|  — >  0.  The  latter  is 
equivalent  to  — >  0  since  |d'„|  <  M.  Further,  notice  that  ET^  =  var(d'„)  + 

(E®,,)2.  We  will  call  Ed'n  the  bias ,  var('Frl)  the  variance ,  and  EdA  the  second 
moment  of  The  same  derivations  and  terminology  hold  for  •!>„. 

We  have  shown  that  studying  conditions  for  convergence  in  probability  of  the 
estimators  to  zero  is  equivalent  to  studying  their  mean  and  variance  (or  the  second 
moment  alone). 

In  this  paper,  we  consider  various  stability  conditions  which  allow  one  to  bound 
bias  and  variance  or  the  second  moment,  and  thus  imply  convergence  of  dbj  and 
d>„  to  zero  in  probability.  Though  the  reader  should  expect  a  number  of  definitions 
of  stability,  the  common  flavor  of  these  notions  is  the  comparison  of  the  “behavior” 
of  the  algorithm  A  on  similar  samples.  We  hope  that  the  present  work  sheds  light 
on  the  important  stability  aspects  of  algorithms,  suggesting  principles  for  designing 
predictive  learning  systems. 
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We  now  sketch  the  organization  of  this  paper.  In  Sec.  2,  we  motivate  the  use  of 
stability  and  give  some  historical  background.  In  Sec.  3,  we  show  how  bias  (Sec.  3.1) 
and  variance  (Sec.  3.2)  can  be  bounded  by  various  stability  quantities.  Sometimes 
it  is  mathematically  more  convenient  to  bound  the  second  moment  instead  of  bias 
and  variance,  and  this  is  done  in  Sec.  4.  In  particular,  Sec.  4.1  deals  with  the  sec¬ 
ond  moment  E3>^  in  the  spirit  of  [4],  while  in  Secs.  4.3  and  4.2,  we  bound  E'l'))  in 
the  spirit  of  [10]  and  [2],  respectively.  The  goal  of  Secs.  4.1  and  4.2  is  to  re-derive 
some  known  results  in  a  simple  manner  that  allows  one  to  compare  the  proofs  side 
by  side.  The  results  of  these  sections  hold  for  general  algorithms.  Furthermore, 
for  specific  algorithms  the  results  can  be  improved,  i.e.  simpler  quantities  might 
govern  the  convergence  of  the  estimators  to  zero.  To  illustrate  this,  in  Sec.  4.4  we 
prove  that  for  the  Empirical  Risk  Minimization  algorithm,  a  bound  on  the  bias 
implies  a  bound  on  the  second  moment  ET^.  We  therefore  provide  a  simple 
necessary  and  sufficient  condition  for  consistency  of  ERM.  If  rates  of  convergence 
are  of  importance,  rather  than  using  Markov’s  inequality,  one  can  make  use  of  more 
sophisticated  concentration  inequalities  with  a  cost  of  requiring  more  stringent  sta¬ 
bility  conditions.  In  Sec.  5,  we  discuss  the  most  rigid  stability,  Uniform  Stability, 
and  provide  exponential  bounds  in  the  spirit  of  [2].  In  Sec.  5.2,  we  consider  less  rigid 
notions  of  stability  and  prove  exponential  inequalities  based  on  powerful  moment 
inequalities  of  [1].  Finally,  Sec.  6  summarizes  the  paper  and  discusses  further  direc¬ 
tions  and  open  questions. 


2.  Historical  Remarks  and  Motivation 

Devroye,  Rogers  and  Wagner  (see,  e.g.,  [4])  were  the  first,  to  our  knowledge,  to 
observe  that  sensitivity  of  the  algorithms  with  regard  to  small  changes  in  the  sam¬ 
ple  is  related  to  the  behavior  of  the  leave-one-out  estimate.  The  authors  were  able  to 
obtain  results  for  the  k-Nearest-Neighbor  algorithm,  where  VC  theory  fails  because 
of  large  class  of  potential  hypotheses.  These  results  were  further  extended  for  k- local 
algorithms  and  for  potential  learning  rules.  Kearns  and  Ron  [6]  later  discovered  a 
connection  between  finite  VC-dimension  and  stability.  Bousquet  and  Elisseeff  [2] 
showed  that  a  large  class  of  learning  algorithms,  based  on  Tikhonov  Regulariza¬ 
tion ,  is  stable  in  a  very  strong  sense,  which  allowed  the  authors  to  obtain  expo¬ 
nential  bounds  without  much  work.  Kutin  and  Niyogi  [8]  introduced  a  number  of 
notions  of  stability  and  showed  implications  between  them.  The  authors  emphasized 
the  importance  of  “almost-everywhere”  stability  and  proved  valuable  extensions  of 
McDiarmid’s  exponential  inequality  [7].  Mukherjee  et  al.  [10]  proved  that  a  com¬ 
bination  of  three  stability  notions  is  sufficient  to  bound  the  difference  between  the 
empirical  estimate  and  the  expected  error,  while  for  Empirical  Risk  Minimization 
these  notions  are  necessary  and  sufficient.  The  latter  result  showed  an  alternative  to 
VC  theory  condition  for  consistency  of  Empirical  Risk  Minimization.  In  this  paper, 
we  prove,  in  a  unified  framework,  some  of  the  important  results  mentioned  above, 
as  well  as  show  new  ways  of  incorporating  stability  notions  in  the  Learning  Theory. 
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We  now  give  some  intuition  for  using  algorithmic  stability.  First,  note  that 
without  any  assumptions  on  the  algorithm,  nothing  can  be  said  about  the  mean 
and  the  variance  of  ’Fra.  One  can  easily  come  up  with  settings  when  the  mean  is 
converging  to  zero,  but  not  the  variance,  or  vice  versa  (e.g.,  Example  1.1),  or  both 
quantities  diverge  from  zero. 

The  assumptions  of  this  paper  that  allow  us  to  bound  the  mean  and  the  vari¬ 
ance  of  and  •!>„  are  loosely  termed  as  stability  assumptions.  Recall  that  if  the 
algorithm  is  a  constant  algorithm,  Tn  is  bounded  by  the  Central  Limit  Theorem. 
Of  course,  this  is  an  extreme  and  the  most  “stable”  case.  It  turns  out  that  the 
“constancy”  assumption  on  the  algorithm  can  be  relaxed  while  still  achieving  tight 
bounds.  A  central  notion  here  is  that  of  Uniform  Stability  [2]: 

Definition  2.1.  Uniform  Stability  /3x(n)  of  an  algorithm  A  is 


Poo(n)  :=  sup  \A(z!, . . .  ,zn-,x)  -  A{z,z2,  ■  ■  ■  ,zn\x)\. 


zi,...,2n,2G2,xGA’ 


Intuitively,  if  /3oo(ri)  — >  0,  the  algorithm  resembles  more  and  more  the  constant 
algorithm  when  considered  on  similar  samples  (although  it  can  produce  distant 
functions  on  different  samples).  It  can  be  shown  that  some  well-known  algorithms 
possess  Uniform  Stability  with  a  certain  rate  on  (3^ (n)  (see  [2]  and  Sec.  5.1). 

In  the  following  sections,  we  will  show  how  the  bias  and  variance  (or  second 
moment)  can  be  upper-bounded  or  decomposed  in  terms  of  quantities  over  “similar” 
samples.  The  advantage  of  this  approach  is  that  it  allows  one  to  check  “stability”  for 
a  specific  algorithm  and  derive  generalization  bounds  without  much  further  work. 
For  instance,  it  is  easy  to  show  that  k-Nearest  Neighbors  algorithm  is  Li-stable  and 
a  generalization  bound  follows  immediately  (see  Sec.  4.1). 

3.  Bounding  Bias  and  Variance 
3.1.  Decomposing  the  bias 

The  bias  of  the  resubstitution  estimate  and  the  deleted  estimate  can  be  written  as 
quantities  over  similar  samples: 


E[£(zi,  • . . , z„;  z)  -  £(z  i,  . . . ,  z„;  zi)] 

E  [£(z,  z2,  ,  zn\ zi)  -  t{z  i,  . . . ,  zn;  zi)]. 


The  first  equality  above  follows  because  E£(zi, . . . ,  zn;  z*,)  =  E£(zi, . . . ,  zn\  zm) 
for  any  fc,m.  The  second  equality  holds  by  noticing  that  E£(zi, . . . ,  zn;  z)  = 
E £(z,  Z2, . . . ,  z„;  zi)  because  the  roles  of  z  and  z\  can  be  switched.  We  will  employ 
this  trick  many  times  in  the  later  proofs,  and  for  convenience,  we  shall  denote  this 
“renaming”  process  by  z  «->  zi. 

Let  us  inspect  the  quantity  E[£(z,  Z2, . . . ,  zn\ zi)  —  t{z\, . . . ,  zn\  Zi)].  It  is  the 
average  difference  between  the  loss  at  a  point  Zi  when  it  is  not  present  in  the 
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learning  sample  (out-of-sample)  and  the  loss  at  z\  when  it  is  present  in  the  n-tuple 
(in-sample).  Hence,  the  bias  ETn  will  decrease  if  and  only  if  the  average  behavior 
on  in-sample  and  out-of-sample  points  is  becoming  more  and  more  similar.  This  is 
a  stability  property  and  we  will  give  a  name  to  it: 

Definition  3.1.  Average  Stability  /?bias(u)  of  an  algorithm  A  is 

/3bias(rc)  :=E[^(z,  z2, . . . ,  zn;  zi)  -  £(zi, . . . ,  zn;  Zx)). 

We  now  turn  to  the  deleted  estimate.  The  bias  E$n  can  be  written  as 


1  X  ^ 

E  -'y'(Ez£(z1,...,zn;z)-£(zi,.. 

n  z — ' 


•  7  ~i—  1  7  Zi-\- 1 7  •  •  •  7  Zm  Zi)) 


E[f(zi,  ...,zn;z)~  i(z2,  •  •  • ,  zn]  z{)\ 

E[/exp(-2'l7  -  •  •  7  Zn)  ^exp(^27  ■  •  •  7  ^n)]- 


We  will  not  give  a  name  to  this  quantity,  as  it  will  not  be  used  explicitly  later.  One 
can  see  that  the  bias  of  the  deleted  estimate  should  be  small  for  reasonable  algo¬ 
rithms.  Unfortunately,  the  variance  of  the  deleted  estimate  is  large  in  general  (see, 
e.g.,  [5,  p.  415]).  The  opposite  is  believed  to  be  true  for  the  resubstitution  estimate. 
We  refer  the  reader  to  [5,  Chaps.  23,  24  and  31]  for  more  information.  Surprisingly, 
we  will  show  in  Sec.  4.4  that  for  Empirical  Risk  Minimization  algorithms,  if  one 
shows  that  the  bias  of  the  resubstitution  estimate  decreases,  one  also  obtains  that 
the  variance  decreases. 

3.2.  Bounding  the  variance 

Having  shown  a  decomposition  of  the  bias  of  4 !n  and  <f>„  in  terms  of  stability 
conditions,  we  now  show  a  simple  way  to  bound  the  variance  in  terms  of  quantities 
over  “similar”  samples. 

Theorem  3.2  (Efron— Stein).  Let  f  :  Zn  i— >  R.  be  a  measurable  function  of 
n  variables  and  define  T  =  £(zi, . . . ,  zn)  and  T'  =  f(z\, . . . ,  z[, . . . ,  zn),  where 
Zi, . . . ,  z„,  z[, . . . ,  z’n  are  i.i.d.  random  variables.  Then 


(3.1) 


A  “removal”  version  of  the  above  is  the  following: 

Theorem  3.3  (Efron  Stein).  Let  f  :  Zn  h- >  R  be  a  measurable  function  of  n 
variables  and  :  Z n^1  R  of  n  —  1  variables.  Define  T  =  £(zi,...,zn)  and 
Tj  =  f{z\, . . . ,  Zi- 1,  Zi+ 1, . . . ,  zn),  where  z\, . . . ,  zn  are  i.i.d.  random  variables.  Then 


n 


var(r)<^E[(r-R)2]. 


(3.2) 
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The  idea  of  the  proofs  of  the  above  result  is  based  on  the  fact  that  var(T)  < 
E(T  —  c)2  for  any  constant  c,  and  so 


var j (T )  =  EZi(T  -  EZT)2  <  EZi(T  -  T*)2. 

Thus,  we  artificially  introduce  a  quantity  over  a  “similar”  sample  to  upper-bound 
the  variance.  If  the  increments  T  —  Tj  and  T  —  T(  are  small,  the  variance  is  small. 
When  applied  to  the  function  \tn(zi, . . . ,  zn),  this  translates  exactly  into  controlling 
the  behavior  of  A  on  similar  samples: 

var(T„)  <  nE($n(zi,  ...,zn)~  iS'n{z2,  ■■■,  zn ))2 
A  2nE(/exp(2i,  •  •  • ,  zn^)  -IeXp(£2,  ■  *  * )  zn)) 

T  2nE(/emp(z:2,  -  -  * ,  zn)  demp(zi,  •  •  ■  7  zn))  . 

Here  we  used  the  fact  that  the  algorithm  is  invariant  under  permutation  of  coordi¬ 
nates,  and  therefore  all  the  terms  in  the  sum  of  (3.2)  are  equal.  This  symmetry  will 
be  exploited  to  a  great  extent  in  the  later  sections.  Note  that  similar  results  can  be 
obtained  using  the  “replacement”  version  of  Efron-Stein’s  bound. 

The  meaning  of  the  above  bound  is  that  if  the  mean  square  of  the  difference 
between  expected  errors  of  functions,  learned  from  samples  differing  in  one  point, 
is  decreasing  faster  than  n-1,  and  if  the  same  holds  for  the  empirical  errors,  then 
the  variance  of  the  resubstitution  estimate  is  decreasing.  Let  us  give  names  to  the 
above  quantities. 

Definition  3.4.  Empirical-Error  (Removal)  Stability  of  an  algorithm  A  is 
fie mp(^)  ■  —  E|/emp  (zq ,  .  .  .  ,  Zn)  /emp  (^1 ,  .  .  •  ,  Zi-\-  \  ,  .  .  .  ,  Zn)  |  . 


Definition  3.5.  Expected-Error  (Removal)  Stability  of  an  algorithm  A  is 

fiex p(n)  :=  ®  l^exp  (Zl,  .  .  .  ,  Zn)  Lexp(^l  j  •  •  •  ;  Zi—  1 ,  Zi-\-\  ,  .  .  .  ,  Zn)  \  . 

With  the  above  definitions,  the  following  theorem  follows: 

Theorem  3.6. 


var(T„)  <  2n(/32xp(n)  +  /32mp(n)). 

The  following  example  shows  that  the  ERM  algorithm  is  always  Empirical-Error 
Stable  with  /3emp(n)  <  M(n  —  l)-1.  We  deduce  that  A  0  for  ERM  whenever 
fiex p  =  o(n_1/'2).  As  we  will  show  in  Sec.  4.4,  the  decay  of  the  Average  Stability, 

p 

fihias (n)  =  o(l),  is  both  necessary  and  sufficient  for  — >  0  for  ERM. 
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Example  3.7.  For  an  Empirical  Risk  Minimization  algorithm,  ^emp(^)  < 


^emp(^2)  •  •  •  ?  %n)  -4mp(^b  •  •  •  5  %n) 

^  n  1  n 

< - -'T£(z2,...,zn;zi) - -J~'£(z1,...,zn;zi) 

71.  —  L '  71.  —  L ' 


i—2 


i=  1 


^  n  1  n 

< - -'y'£(z2,...,zn;zi) - -J~'£(z1,...,zn;zi) 

n  —  1  '  ri.  —  1  z ' 


i= 2 

M  1 


n  —  1 


z=2 


-£(,zi,...,2n;;zi)  < 


n  —  1  n  —  1 
and  the  other  direction  is  proved  similarly. 


M 

n  —  1 


M 

n  —  1 


We  will  show  in  the  following  sections  that  a  direct  study  of  the  second  moment 
leads  to  better  bounds.  For  the  bound  on  the  variance  in  Theorem  3.6  to  decrease, 
/3exp  and  /3emp  have  to  be  o(n^1/2).  With  an  additional  assumption,  we  will  be  able 
to  remove  the  factor  n  by  upper-bounding  the  second  moment  and  by  exploiting 
the  structure  of  the  random  variables  d>„  and 


4.  Bounding  the  2nd  Moment 

Instead  of  bounding  the  mean  and  variance  of  the  estimators,  we  can  bound  the 
second  moment.  The  reason  for  doing  so  is  for  mathematical  convenience  and  is 
due  to  the  following  straightforward  bounds  on  the  second  moment: 


ET2  =  E[Ez£(2i,  ...,zn;  z )]2  -  E 
1  " 

+  E  -  S'  £(z  1, . . .  ,zn;  Zi) 

n  ' 


Ez£{z  1 3  •  •  •  )  ^  '  £(^1 3  ■  •  •  ,  Zni  Zi) 

n  ^ 


-E 


-^2^(^l3  •  •  •  3  ^ri3  ^  £(Zl)  •  •  *  3  ^ri3  %i) 

n  2 — J 


<  E[E-£(2i,  . . . ,  zn;  z)Ez>l(z\,  ...,zn;  z')  -  Ez£(z^  ...,zn\  z)l(z\,  ...,zn;  zi)\ 
+  E[£(z\, . . . ,  Zn-,  Zi)£(zi,  ...,zn;  z2 )  -  E^Oi,  ...,zn;  z)£(z  i,  ...,zn;  2i)] 

+  -E£(z1,...,zn;z1)2, 
n 


and  the  last  term  is  bounded  by  — .  Similarly, 

E4>2  <  E[E2£(^i,  . . . ,  zn-,  z)Ezd(zi,  z')  -  Ez£(zi,  ...,zn;  z)£(z2,  ■■■,zn]  zi)] 

+  E  [£{z2,  . . . ,  zn;  zi)£(z1:z3,  ...,zn-,  z2 )  -  Ez£{zi,  ...,zn;  z)£(z2,  ...,zn;  2i)] 

+  -E£(z2,...,zn;z1)2, 


and  the  last  term  is  bounded  by 


M2 

n 
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In  the  proofs,  we  will  use  the  following  inequality  for  random  variables  X:  X' 
and  Y : 

E[XY  -  X'Y]  <  ME\X  -  X'\  (4.1) 

if  —M  <Y  <  M.  The  bounds  on  the  second  moments  are  already  sums  of  terms  of 
the  type  “E[AF  —  WZ]”,  and  we  will  find  a  way  to  use  symmetry  to  change  these 
terms  into  the  type  “E[XF  —  X'Y\\  where  X  and  X'  will  be  quantities  over  similar 
samples,  and  so  E|X  —  X'\  will  be  bounded  by  a  certain  stability  of  the  algorithm. 

4.1.  Leave-one-out  (deleted)  estimate 

We  have  seen  that  E4>„  =  E[/exp(;zi, . . . ,  zn)  —  /exP(^2, . . . ,  zn)\  and  thus  the  bias 
decreases  if  and  only  if  the  expected  errors  are  similar  when  learning  on  sim¬ 
ilar  (one  additional  point)  samples.  Moreover,  intuitively,  these  errors  have  to 
occur  at  the  same  places  because  otherwise  evaluation  of  leave-one-out  functions 
£(zi, . . . ,  Zi-i,  Zi+i, . . . , zn;  z)  will  not  tell  us  about  £(zi, . . .  ,zn;  z).  This  implies 
that  the  L\  distance  between  the  functions  on  similar  (one  additional  point)  sam¬ 
ples  should  be  small.  This  connection  between  L\  stability  and  the  leave-one-out 
estimate  has  been  observed  by  Devroye  and  Wagner  [4]  and  further  studied  in  [6]. 
We  now  define  this  stability  notion: 

Definition  4.1.  Lj-Stability  of  an  algorithm  A  is 

Pi(n)  ■=  \\£(z!, . . . ,  zn-,  ■)  -£(z2,...,zn-,-)\\LlW 
=  Ez\£(zi,...,zn;z)  -  £(z2, . . . ,  zn;  z)  |. 

The  following  theorem  is  proved  in  [4,5]  for  classification  algorithms.  We  give 
a  similar  proof  for  general  learning  algorithms.  The  result  shows  that  the  second 
moment  (and  therefore,  both  bias  and  variance)  of  the  leave-one-out  error  estimate 
is  bounded  by  the  L\  distance  between  loss  functions  on  similar  samples. 

Theorem  4.2. 

M2 

E$2  <  M(2/3i(n  -  1)  +  4/3i(n))  + - . 

n 

Proof.  The  first  term  in  the  decomposition  of  the  second  moment  of  E4>2  can  be 
bounded  as  follows: 

E[f(zi,  z)£(z  i,  ...,zn;  z')  -  £(zx,  ...,zn\  z)£(z2,  ...,zn;  zi)\ 

=  E[f(^i,  ...,zn;  z)£(z  i,  ...,zn;  z')  -  £{z' ,  z2,...,  zn ;  z)£(z2,  ...,zn;  z')] 

=  E[£(zi,  ...,zn;  z)£(z  i,  ...,zn;  z')  -  £(z2,  ...,zn;  z)£(zlt  ...,zn;  z')\ 

+  ¥]£(z2,  ...,zn]  z)£(zi,  ...,zn;  z')  -  £(z',  z2,...,  zn\  z)£(zi,  ...,zn;  z')\ 

A  E  [£(z'}  z2,...,  zn\ z)£(z  i,  ...,zn;  z')  -  £(z' ,  z2,...,  zn;  z)£(z2,  ...,zn;  z')\ 

<  3M/3i(n). 
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The  first  equality  holds  by  renaming  z'  <->•  z\.  In  doing  this,  we  are  using  the  fact 
that  all  the  variables  z\, . . . ,  zn,  z,  z'  are  identically  distributed  and  independent. 
To  obtain  the  inequality  above,  note  that  each  of  the  three  terms  is  bounded  (using 
(4.1))  by  Mpi(n)  . 

The  second  term  in  the  decomposition  is  bounded  similarly: 

E  [£(z2,  ■  ■  ■ ,  zn;  zi)£(zi,z3,  ...,zn;  z2)  -  £(zlt  ...,zn;  z)£(z2,  ...,zn;  z3)] 

=  E  [£(zr,  z3,...,  zn-,  z)£(z,  z3, ,  zn;  z')  -  £{z' ,  z2,...,  zn ;  z)£(z2,  ...,zn;  z')} 

=  E [£(z',  z3, ... ,  Zn-,  z)£(z,  z3, ... ,  zn-,  z')  -  £(z! ,  z2, ... ,  Zn,  z)£(z,  z3, . . . ,  zn;  z')] 
+  E  [£(z',  z2,...,zn;  z)£(z,  z3, . . . ,  zn;  z')  -  £{z',  z2,...,zn;  z)£(z3,  ...,zn;  z')] 

+  E  [£(z',  z2,...,zn;  z)£(z3,  ...,zn;  z')  -  £{z' ,  z2,...,zn;  z)£(z2,  ...,zn;  z')] 

<  M/h(n)  +  2M/31(n  -  1). 

The  first  equality  follows  by  renaming  z2  <->  z'  as  well  as  z\  <->  z  in  the  first  term, 
and  Z\  <->  zJ  in  the  second  term.  Finally,  we  bound  the  last  term  by  M2 /n  to  obtain 
the  result.  □ 

4.2.  Empirical  error  (resubstitution)  estimate:  replacement  case 

Recall  that  the  bias  of  the  resubstitution  estimate  is  the  Average  Stability,  ETn  = 
/3bias-  However,  this  is  not  enough  to  bound  the  second  moment  E'F)(  for  general 
algorithms.  Nevertheless,  /3bias  measures  the  average  performance  of  in-sample  and 
out-of-sample  errors  and  this  is  inherently  linked  to  the  closeness  of  the  resubsti¬ 
tution  (in-sample)  estimate  and  the  expected  error  (out-of-sample  performance) .  It 
turns  out  that  it  is  possible  to  derive  bounds  on  E'lA  by  using  a  stronger  version  of 
the  Average  Stability.  The  natural  strengthening  is  requiring  that  not  only  the  first, 
but  also  the  second  moment  of  £(z\, . . . ,  zn ;  Zi)  —  i(z3, . . . ,  z[, . . . ,  zn',  zi)  is  decaying 
to  0.  We  follow  [8]  in  calling  this  type  of  stability  Cross-Validation  (CV)  Stability: 

Definition  4.3.  CV  (Replacement)  Stability  of  an  algorithm  A  is 

Pcvr  ■  —  E|£(zi,  .  .  .  ,  Zn,  Z\  )  £  (z,  Z2,  .  .  .  ,  Zn,  ^1 )  | , 

where  the  expectation  is  over  a  draw  of  n  +  1  points. 

The  following  theorem  was  proven  in  [2].  Here,  we  give  a  version  of  the  proof. 

Theorem  4.4. 

M 2 

ET^  <  6M/3cvr(n)  -I - . 

n 

Proof.  The  first  term  in  the  decomposition  of  ET^  can  be  bounded  as  follows: 

E[E,f(zi,  ...,zn;  z)Ezd(zi,  ...,zn;  z')  -  Ez£(zi, . . . ,  zn;  z)£(zi,  ...,zn;  z2)\ 

=  E[f(zi,  z' ,  z3, ... ,  Zn,  z)£{z\,  z' ,  z3, ... ,  zn-,  z2)  -  £{z\, . . . ,  zn\  z)£(zi,  ...,zn;  Z2)\ 
=  E[f(zi,  Z  ,  Z3,  ... ,  Zn,  z)£(z  1,  z' ,  Z3,  .  .  .  ,  Zn,  Z2) 
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-  £{zx,  z,z  3,  •  •  • ,  zn; z)£(z  1,  z',  z3,...,  zn ;  z2)] 

+  E[^(zi,  Z,Z3,...,  Zn,  z)£(zi,  z',  Z3,...,  Zn-,  Z2) 

-  £(zlt  ...,zn;  z)£(z  1,  z' ,  z3,...,  zn ;  -22)] 

+  E[^(zi,  ...,Zn,  z)£(z1,  Z  ,  Z3,  ...  ,  Zn,  Z2)  ~  £(z1;  .  .  .  ,  Zn\  z)£^Z3,  ■  ■  ■  ,  Zn\ Z2)\ 

<  3M/3cvr(n). 

The  first  equality  follows  from  renaming  z2  <->  z'  in  the  first  term.  Each  of  the  three 
terms  in  the  sum  above  is  bounded  by  M/3cvr(n). 

The  second  term  in  the  decomposition  of  E'f'^  can  be  bounded  as  follows: 

E  [£{z\, ... ,zn ;  z-y)l(z3,  ...,zn;  z2)  -  Ez£(zlt . .  .,zn;z)£(z  1,  ...,zn-,  Zi)\ 

=  E  [£(z,  z2,...,  Zn,  z)£(z,  Z2,...,  Zn,  Z2)  -  £{z\,  ...,Zn,  z)£{z\,  . .  .  ,  Zn\ Z2)\ 

=  E  [£(z,  z2,...,  Zn,  z)£(z,  z2,...,  Zn,  z2)  -  £{zi,  ...,zn;  z)£(z,  z2, . . . ,  zn;  z2)) 

+  E  [£(zi,  ...,zn;  z)£(z,  z2,...,  zn ;  z2) 

-  £(zi,  ...,Zn,  z)£(zlt  Z,Z3,...,  Zn,  Z2)] 

+  E  [£{zi,  ...,Z„;  z)£(z  1,  z,z3,...,  zn\ z2)  -  £(z  1,  ...,zn;  z)£(zi,  ...,zn;  z2)\ 

<  3 M/3CVI(n). 

The  first  equality  follows  by  renaming  z\  <->  z  in  the  first  term.  Again,  each  of  the 
three  terms  in  the  sum  above  can  be  bounded  by  M/3cvr(n).  □ 


4.3.  Empirical  error  (resubstitution)  estimate 

Mukherjee  et  al.  [10]  considered  the  “removal”  version  of  the  CV  stability  defined 
in  Sec.  4.3,  the  motivation  being  that  the  addition  of  a  new  point  z'  complicates 
the  cross-validation  nature  of  the  stability.  Another  motivation  is  the  fact  that 
£{z  1, . . . ,  zn',  Z\)  —  £(z2,  . . . ,  zn',  Z\)  is  non-negative  for  Empirical  Risk  Minimization. 
It  turns  out  that  this  “removal”  version  of  the  CV  stability  together  with  Expected 
and  Empirical  Stabilities  upper-bound  Efn.  Following  [10],  we  have  the  following 
definition: 

Definition  4.5.  CV  (Removal)  Stability  of  an  algorithm  A  is 

/3C v(n)  :=  Ej£(z!, . . . ,  zn;  Zi)  -  £(z2, . . . ,  zn;  Zi)\. 

The  following  theorem  was  proven  in  [10].  Here,  we  give  a  version  of  the  proof. 


E “  <  M((3cv(ri)  +  4/3exp(n)  -I-  2 /3emp(n))  -I - . 

n 


Theorem  4.6. 
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Proof.  The  first  term  in  the  decomposition  of  the  second  moment  of  ET2  can  be 
bounded  as  follows: 


E  [£(z\,  ...,zn;  z)£(zi,  ...,zn;  z')  -  £{z\,  ...,zn;  z)£(zi,  ...,zn\  21)] 

=  E  \£{z',  z2,...,  zn;  z)£(z',  z2,...,  zn ;  z  1)  -  £(z  1,  ...,zn;  z)£(z  1,  ...,zn;  zi)] 

=  E  [£(z',  z2,...,  zn-,  z)EZl£(z',  z2,...,  zn-,  zi) 

-  £(z',  z2, ... ,  Zn,  z)EZl£(z2,  ...,zn;  zi)] 

+  E  [E  Z£(z',  z2, ,  Zn-,  z)£{z2, ... ,zn-,  Zi)  -  Ez£(z2,  ...,Zn,  z)£(z2,  ...,zn;  Zi)\ 
+  E  [Ez£(z2,  ...,zn;  z)£(z2,  ...,z„;  zi)  -  Ez£(zi,  . . . ,  z„; z)£(z2,  . . . ,  z„;  zi)] 

+  E  [£(zi,  ...,zn;  z)£(z2,  ...,zn;  zi)  -  £(zlt  ...,zn;  z)£(zi,  ...,zn;  zi)] 

<  M(3/3exp(n)  +/3cv(n)). 


The  first  equality  follows  by  renaming  z\  <->  z  in  the  first  term.  In  the  sum  above, 
the  first  three  terms  are  each  bounded  by  Mf3exp(n),  while  the  last  one  is  bounded 
by  M/3cv(n).  Since  the  Expected  (and  Empirical)  Error  Stability  has  been  defined 
in  Sec.  3.2  as  expectation  of  a  square,  we  used  the  fact  that  E|X|  <  (EX2)1/2. 

The  second  term  in  the  decomposition  of  ET2  is  bounded  as  follows: 

'  1  n  \  2  1  n 

E  \  -^2£(zi,...,zn;Zi)\  -Ez£(z1,...,zn;z)-^2£(z1,...,z„;zi) 


i— 1 


i=l 


=  E 


=  E 


1  x.  \  1  x  \ 

£(zi, . .  ■  ,zn-,zi)~)  £(zi,  ...,zn;zi)-  l{z\, . . .  ,zn\z)-y  l(z\,. . . ,zn\Zi ) 

n  z — *  n  z — ' 

i—  1  2=1 

^  n  1  n 

£{z\, . . .  ,z„;zi)- y^£(zi,  ...,zn-,Zi)-  l(z2,..  .,zn;zi)~TJ(zi,  ■  ■  -,zn;zi) 
n  z — •  n  t — 

i—  1  i—  1 


+  E 


+  E 


+  E 


^  it  ^  it 

£(z2,  ...,zn;zi)-'Y'  £{z\,  ...,zn;zi)-  £(z2,  ...,zn;  z) - -  V)  l(z2,  . . . ,  z„;  Zj) 

n  z — i  n  —  1  z — ' 

2—1  2—2 

^  n  1  n 

£(22, . .  .  ,Zn;z) - -  T  £{z2, . . . ,  zn]  z^  -  £(z2,  . . .  ,zn;z)~  'S~'£(z1,  ...,z„;zi) 

n  —  1  z — '  n  z — ' 

2=2  2=1 

1  n  1  n 

Ez£(z2,  ... ,  zn\z)-yi{z\, ...  ,zn;zi)  -  Ez£(z1; . . .  ,zn;z)~y~'£(z1,. .  .,zn;zi) 


<  M(/3Cv(n)  +  2/3emp(^)  +  /dexp(n,)). 


The  first  equality  follows  by  symmetry: 

^  n  1  n 

^(^1  j  .  .  .  ,  2n,  Zfc)  ^  ^(-^1 5  •  •  •  5  ^77,5  Zi)  —  ^(-^l  7  •  •  •  ?  Zm)  ^  ^  ^(^1  ?  •  •  •  5 

n  n 

2=1  2=1 

for  all  k,m.  The  first  term  in  the  sum  above  is  bounded  by  M/3cw(n).  The  second 

term  is  bounded  by  M /3emp(n)  (and  z\  z).  The  third  term  is  also  bounded  by 

M^mpW)  and  the  last  term  by  M/3exp(n).  □ 
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4.4.  Resubstitution  estimate  for  the  Empirical  Risk  Minimization 
algorithm 

It  turns  out  that  for  the  ERM  algorithm,  'Ln  is  “almost  positive”.  Intuitively,  if  one 
minimizes  the  empirical  error,  then  the  expected  error  is  likely  to  be  larger  than 

p 

the  empirical  estimate.  Since  is  “almost  positive”,  E'I'n  — >  0  implies  |’i'ra|  — >  0. 
We  now  give  a  formal  proof  of  this  reasoning. 

Recall,  that  an  ERM  algorithm  searches  in  the  function  space  T .  Let 

/*  =  argminE 

the  minimizer  of  the  expected  error. b  Consider  the  shifted  loss  class 

and  note  that  E z£'(f;  z)  >  0  for  any  /  £  T .  Trivially,  if  l{z i, . . . ,  z„;  ■)  is  an  empirical 
minimizer  over  the  loss  class  E[T),  then  £'(f;  •)  =  £(z i, . . . ,  zn;  •)  —  £(/*;  •)  is  an 
empirical  minimizer  over  the  shifted  loss  class  £  (F) 

1  n 

Ez£'(zi,  ...,zn;z) - VV(zi,  ...,zn;  zf) 

n  z ' 

4=1 

unzjj . 

Note  that  ^  ^”=1  ^(zij  •  •  • ,  zn]  Zi)  <  0  because  £{F)  contains  the  zero  function. 
Therefore,  the  left-hand  side  is  non-negative  and  the  second  term  on  the  right-hand 
side  is  small  with  high  probability  because  /*  is  non-random.  We  have 

P(tf  „(*!,  •  •  •  ,  Zn)  <  -e)  <  P  (e  zt{f-  z)  -  ^  *(/*!  *0  <  ^  ^  g-2ne2/M2. 

Therefore, 

EjT„|  <  ET„  +  2e  +  2 Me-2”2/"2. 

If  E'I'n  — '  0,  the  right-hand  side  can  be  made  arbitrarily  small  for  large  enough  n, 
thus  proving  E|\f'n|  — >  0.  Clearly,  E'I'n  — '  0  whenever  E|\I'n|  — >  0.  Hence,  we  have 
the  following  theorem: 

Theorem  4.7.  For  Empirical  Risk  Minimization,  (3b\as{£)  — ►  0  is  equivalent  to 

|  'I'n  |  0 . 

Remark  4.8.  With  this  approach,  the  rate  of  convergence  of  Iemp(zi, . . . ,  zn) 
to  Iexp(zi, . . . ,  zn)  is  limited  by  the  rate  of  convergence  of  y-  S"=i  ^(/*i  zi)  t° 
z ),  which  is  0(n-1//2)  without  further  assumptions. 

bIf  the  minimizer  does  not  exist,  we  consider  e-minimizer. 


1  .  . 

=  Ez^(2i,  ...,z„;z) - V' 

n 

4=1 


i{zx,  ...,zn;Zi)~  E z) - V 
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For  ERM,  one  can  show  that  |/emPOi,  ■  ■  ■ ,  zn)  -  Iem  P(z2,  ■  ■  ■ ,  zn)  |  <  Hence, 
a  “removal”  version  of  Average  Stability  is  closely  related  to  Average  Stability: 

E  ,  2n;  zi)  -  £(z2,  •  •  • ,  Zn\  Zx)) 

—  E  (Zemp('2:l?  •  •  •  ?  Zn)  /exp (Z2 •>  •  •  ■  5  Zn)) 

=  Pbi&s  (jt  1)  E  (/emp(^2)  ■  ■  •  >  %n)  ^emp  (^1  ?  •  •  •  ?  ^n))  . 

p 

Thus,  E  (£{z\, . . . ,  zn;  Zi)  —  £(z2,  ■  ■  . ,  zn;  Z\))  — >  0  is  also  equivalent  to  |^n|  — >  0. 
Furthermore,  one  can  show  that 

£{zi,  ...,zn-,  z-f)  -  £(z2,  ...,zn;z i)  >  0 

for  ERM  (see  [10]),  and  so  CV  (Removal)  Stability,  defined  in  Sec.  4.3,  is  equal  to 
the  above  “removal”  version  of  Average  Stability.  Hence,  /3cv(n)  — >  0  is  equivalent 
to  |^„|  L  0. 

Since  Empirical  Risk  Minimization  over  a  uniform  Glivenko-Cantelli  class 

p 

implies  that  |\l/n|  — >  0,  it  also  implies  that  /3bias(n)  — ►  0  and  /3cv(n )  — >  0.  Thus, 
ERM  over  a  UGC  class  is  stable  in  these  regards.  By  using  techniques  from  the 
Empirical  Process  Theory,  it  can  be  shown  (see  [3])  that  for  ERM  over  a  smaller 
family  of  classes,  called  Donsker  classes,  a  much  stronger  stability  in  L\  norm  (see 
Definition  4.1)  holds:  Pi (n)  — ►  0.  Donsker  classes  are  classes  of  functions  satisfying 
the  Central  Limit  Theorem,  and  for  binary  classes  of  function  this  is  equivalent  to 
finiteness  of  the  VC  dimension. 


5.  Rates  of  Convergence 

P 

Previous  sections  focused  on  finding  rather  weak  conditions  for  proving  T,,  — >  0  and 

p 

<!>„  — >  0  via  Markov’s  inequality.  With  stronger  notions  of  stability,  it  is  possible  to 
use  more  sophisticated  inequalities,  which  is  the  focus  of  this  section. 


5.1.  Uniform  stability 


Uniform  Stability  (see  Definition  2.1),  is  a  very  strong  notion,  and  we  would  not 
expect,  in  general,  that  poo (n)  — >  0.  Surprisingly,  for  Tikhonov  Regularization 
algorithms 

1  n 

A{zi,  ...,*„)=  argmin  -  ^  £(f-  zf)  +  X\\f\\2K, 


it  can  be  shown  [2]  that 


Poo(n)  < 


L2k2 
2A  n  1 


where  T  is  a  reproducing  kernel  Hilbert  space  (RKHS)  with  kernel  K,  K(x,x)  < 
k2  <  oo,\/x  £  A,  and  L  is  a  Lipschitz  constant  relating  norms  between  functions 
/  £  T  to  norms  between  loss  functions  i  £  C{!F). 
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Clearly,  P^  dominates  all  stabilities  discussed  in  the  previous  sections,  and  so 
can  be  used  to  bound  the  mean  and  variance  of  the  estimators.  For  this  strong 
stability,  a  more  powerful  concentration  inequality  can  be  used  instead  of  Markov’s 
inequality.  McDiarmid’s  bounded  difference  inequality  states  that  if  a  function  of 
many  random  variables  does  not  change  much  when  one  variable  is  changed,  then 
the  function  is  almost  a  constant.  This  is  exactly  what  we  need  to  bound  T,,  or  <!>„. 

Theorem  5.1  (McDiarmid,  [9]).  Let  £  :  Zn  i — >■  1R  &e  a  measurable  function , 
r  =  £(zi,...,2n),  T'  =  ^(z1,...,z'i,...,zn),  where  zlt . . . ,  zn,z[, . . . ,  z'n  are  i.i.d. 
random  variables.  If  for  all  i, 

sup  |r  -r'|  <  pn,  (5.i) 

then  for  any  e  >  0, 

r(|r-Er|>£)<2exp(^0. 

Bousquet  and  Elisseeff  [2]  applied  this  inequality  to  T  =  A/n: 

•  •  •  5  Zn)  •  •  •  ?  2,  .  .  .  ,  Zn)  \ 

—  |Zemp  (^1  ?  •  •  •  ?  Zn)  ^emp^j  ^2?  •  •  ■  ?  Zn)\ 

H-  |-^exp  (^1 1  •  •  •  i  %n)  /exp  (z,  Z2,  •  •  •  ?  Zn)  | 

<  -t(z,  Z2,...,zn;  z)  I 

n 

1  n 

+  -  E  l^( Zl ’  •  •  •  >zn‘,  Zj)  -  ^(z,  ^2, - ,  Zn;  Zj)| 

71  1=2 

+  E~K(zi,  •  •  • ,  z„;  z')  -  £(2,  22,  ...  ,  Zn;  2')| 

M 

P  ‘I‘Poo(n)  H - =:  Pn- 

n 

If  Poo{n)  =  o(n-1/2),  McDiarmid’s  inequality  shows  that  is  exponentially 
concentrated  around  ETn,  which  is  also  small: 

E ’k n  =  dbias (n)  ^  Pooipf- 

Therefore, 

VE>0,  P(*„  >  M»)  +  *)  <  2exP  (-(2„fc"n£)  +  my)  ■ 

Notice  that  for  ERM,  \Iemp(zi,  ...,zn)  -  /emp(z,  z2, . .  • ,  zn)  \  <  ™  and  so  it  is 
enough  to  require  ph  ias  ->  0  and  |/exp(zi,  ...,zn)~  Iexp(z,  z2,  ■  ■  ■ ,  zn)|  =  o(n-1/2) 
to  get  exponential  bounds.  The  last  requirement  is  strong,  as  it  requires  expected 
errors  on  similar  samples  to  be  close  for  every  sample.  The  next  section  deals  with 
“almost-everywhere”  stabilities  (see  [8]),  i.e.  when  a  stability  quantity  is  small  for 
most  samples. 
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5.2.  Extending  McDiarmid’s  inequality 

As  one  extreme,  if  we  know  that  Poo(n)  =  o(n-1/2),  we  can  use  exponential 
McDiannid’s  inequality.  As  the  other  extreme,  if  we  only  have  information  about 
averages  Pemp  and  /3eXp,  we  are  forced  to  use  the  second  moment  and  Chebyshev’s 
or  Markov’s  inequality.  What  happens  in  between  these  extremes?  What  if  we  know 
more  about  the  random  variables  /emp(^ij  ■  • . ,  zn)  —  /em p(z,  22, ... ,  zn)l  One  exam¬ 
ple  is  the  case  when  we  know  that  these  random  variables  are  almost  always  small. 
Unfortunately,  assumptions  of  McDiarmid’s  inequality  are  no  longer  satisfied,  so 
other  ways  of  deriving  exponential  bounds  are  needed.  This  section  elaborates  on 
this  situation. 

Assume  that  for  a  given  /?„,  a  measurable  function  £  :  Zn  1— >  [—  M,  M)  satisfies 
the  bounded  difference  condition  (5.1)  on  a  subset  G  C  Zn  of  measure  1  —  Sn,  while 


V  (zi, . . . ,  zn)  £  G,  3  z\  £  Z 

S.t.  /3n  <  |£(zi,  .  .  .  ,  zn)-€(zi,  ...,4,...,  Zn)  \  <  2  M, 

where  G  is  the  complement  of  the  subset  G.  Again,  denote  T  =  £(zi, . . . ,  zn), 
T'  =  £(zi, . . . ,  z[, . . . ,  Zn).  A  simple  application  of  Efron-Stein  inequality  shows  that 

var(T)  <  (£(^i,  £(2,  z2, . . . ,  zn ))2 


1 


I(z1,...,zn)eG  Zn)Y 


<  —riE 
~  2 

+ 

<  \ n(Pl  +  4M28n ). 


I(z1,...,zn)eG  (C(-l,  •  •  •  ,Zn)  ~  Z2,...,  Zn)Y 


(5.2) 


This  leads  to  a  polynomial  bound  on  P(|T  —  ET|  >  e).  Kutin  and  Niyogi  [7,8] 
proved  an  inequality  which  is  exponential  when  Sn  decays  exponentially  with  n, 
thus  extending  McDiarmid’s  inequality  to  incorporate  a  small  possibility  of  a  large 
jump  of  £.  A  more  general  version  of  their  bound  is  the  following: 


Theorem  5.2  (Kutin  and  Niyogi  [8]).  Assume  £  :  Zn  1— >  satisfies 

the  bounded  difference  condition  (5.1)  on  a  set  of  measure  1  —  Sn  and  denote  T  = 
£(zi,  ...,zn).  Then ,  for  any  e  >  0, 

P(|r-Er|>e)<2exp(^J)+^.  (5.3) 

Note  that  the  bound  tightens  only  if  pn  =  o(n-1/2)  and  8n/pn  =  o(n_1). 
Furthermore,  the  bound  is  exponential  only  if  Sn  decays  exponentially.0 

While  the  variance  bound  in  (5.2)  is  written  in  terms  of  the  second  moment,  we 
can  use  powerful  moment  inequalities,  recently  developed  by  Boucheron  et  al.  [1], 


By  exponential  rate,  we  mean  decay  o(exp(— nr))  for  a  fixed  r  >  0. 
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to  bound  the  qth  moment  of  T.  Moreover,  q  can  be  optimized  to  get  the  tightest 
bounds. d 

Define  random  variables  V+  and  V—  as 


V+  =E 


£(r-r')2/r>r^i,.. 

_i=  1 


V-  =E 


y~!(r  -  r')2  jr<r/ \zi, ...  ,z, 

_i= 1 


Theorem  5.3  (Boucheron  et  al.  [1]).  For  £  :  Zn  i— >  R,  T  =  £(zi, . . .  ,zn),  and 
any  q  >  2, 


||(r-ET)+||,  <  y/2^\\y/v;\\q  and  || (T  -  Er)_ ||,  <  y^ll y/V±\\q, 


where  x+  =  max(0,:r)  and  k  «  1.271  is  a  constant. 


This  result  leads  directly  to  the  following  theorem: 


Theorem  5.4.  Assume  £  :  Zn  i— >  R  satisfies  the  bounded  difference  condition  (5.1) 
on  a  set  of  measure  1  —  Sn,  and  denote  T  =  £(zi, . . . ,  zn).  Then  for  any  q  >  2 
and  e  >  0, 


where  k 


p(r  -  Er  >  e)  < 


(nq)q/2((2K)q/2pq  +  (2M)q8n) 


1.271. 


Proof. 

E V*'2  =  E {lGVl/2  +  IQVl/2}  <  (nfay/2  +  (nq(2M)2)q/25n. 
By  Theorem  5.3, 

E(r  -ET)«  <  (2Kq)q/2EV^/2  <  {n(32nq2n)q/2  +  ( n(2M)2)q/26n . 


Hence, 


p(r  -  Er  >  e)  < 


E(r  -  ET)l  {nq)q/2{{2n)q/2f3qn  +  (2 M)q5n) 


£q 


< 


□ 


£q 


Note  that  the  bound  of  Theorem  5.4  holds  for  any  q  >  2.  To  clarify  the 
asymptotic  behavior  of  the  bound,  assume  (3n  =  n  1  for  some  7  >  1/2,  and  let 
q  =  e2  (3f2n~2l+ri  =  e^n71  for  some  77  to  be  chosen  later,  2y  —  1  >  77  >  0.  Assume 
Sn  =  exp(7i-e)  for  some  6  >  0.  The  bound  of  Theorem  5.4  becomes 


p(r  -  Er  >  e)  < 


[nq)ql2[(2K)ql2f3qn  +  (2M)q8n) 


£q 


< 


^2K,nqP2y/2  |  5  ^ AM2nq ^ 


g/2 


<  (2nn1+r,~21)  %  n  +  (4M2n1+r>)  ‘a  "  exp(-7ie) 


d Thanks  to  Gabor  Lugosi  for  suggesting  this  method. 
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/ 

<  exp  (  (1  +  (1  +  77  —  27)  log  n)nv  — 

+  exp  ^(21og(2M)  +  (1  +  77)  log  77)  7i,?^ —  n9^  .  (5.4) 

Since  1  +  77  —  27  <  0,  the  first  term  is  decaying  exponentially  with  n.  We  can  now 
choose  77  <  min(0,  2y  —  1)  for  the  second  term  to  decay  exponentially.  In  particular, 
let  us  compare  our  result  to  the  result  of  Theorem  5.2.  With  Sn  =  exp(n~9)  the 
bound  in  Eq.  (5.3)  becomes 

/  e2 

P(T  -  ET  >  e)  <  exp  f  -yn27"1 

+  exp((logM  +  (7+  1)  log  71)  —  ne).  (5.5) 

Depending  on  whether  9  <  27  —  1  or  not,  the  first  or  second  term  dominates 
convergence  to  zero,  which  coincides  exactly  with  the  asymptotic  behavior  of  our 
bound.  In  fact,  one  can  verify  that  the  terms  in  the  exponents  of  bounds  (5.4)  and 
(5.5)  have  the  same  order. 

We  have  therefore  recovered  the  resulte  of  Theorem  5.2  for  the  interesting  case 

Sn  =  exp(— n9)  by  using  moment  inequality  of  Boucheron  et  al.  [1].  Note  that  the 

result  of  Theorem  5.4  is  very  general  and  different  ways  of  picking  q  might  prove 

useful.  For  instance,  if  5n  =  0,  i.e.  the  bounded  difference  condition  (5.1)  holds,  we 
2 

can  choose  q  =  -^2  to  recover  McDiarmid’s  inequality. 

Having  proven  extension  to  McDiarmid’s  inequality,  we  can  use  it  in  a  straight¬ 
forward  way  to  derive  bounds  on  P(|'F„|  >  e)  and  P(|4>„|  >  e)  when  expected  and 
empirical  quantities  do  not  change  “most  of  the  time” ,  when  compared  on  similar 
samples  (see  [8]  for  examples). 

6.  Summary  and  Open  Problems 

We  have  shown  how  stability  of  algorithms  provides  an  alternative  to  classical  Sta¬ 
tistical  Learning  Theory  approach  for  controlling  the  behavior  of  empirical  and 
leave-one-out  estimates.  The  results  presented  are  by  no  means  a  complete  picture: 
one  can  come  up  with  other  notions  of  algorithmic  stability,  suited  for  the  prob¬ 
lem.  Our  goal  was  to  present  some  results  in  a  common  framework  and  delineate 
important  techniques  for  proving  bounds. 

One  important  (and  largely  unexplored)  area  of  further  research  is  looking  at 
existing  algorithms  and  proving  bounds  on  their  stabilities.  For  instance,  work  of 
Caponnetto  and  Rakhlin  [3]  showed  that  Empirical  Risk  Minimization  (over  certain 
classes)  is  Li-stable.  It  might  turn  out  that  other  algorithms  are  stable  in  this  (or 
even  stronger)  sense  when  considered  over  restricted  function  classes,  which  are 
nevertheless  used  in  practice.  Can  these  results  lead  to  faster  learning  rates  for 
algorithms? 

eThis  gives  an  answer  to  the  open  question  6.2  in  [7]. 


416  A.  Rakhlin,  S.  Mukherjee  &  T.  Poggio 


Adding  a  regularization  term  for  ERM  leads  to  an  extremely  stable  Tikhonov 
Regularization  algorithm.  How  can  regularization  be  used  to  stabilize  other  algo¬ 
rithms,  and  how  does  this  affect  the  bias- variance  trade-off  of  fitting  the  data  versus 
having  a  simple  solution? 

Though  the  results  presented  in  this  paper  are  theoretical,  there  is  a  potential 
for  estimating  stability  in  practice.  Can  a  useful  quantity  be  computed  by  running 
the  algorithm  many  times  to  determine  its  stability?  Can  this  quantity  serve  as  a 
measure  of  the  performace  of  the  algorithm? 
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